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Abstract 

Deep neural network architectures have recently produced excellent results in a variety of areas in artificial 
intelligence and visual recognition, well surpassing traditional shallow architectures trained using hand-designed 
features. The power of deep networks stems both from their ability to perform local computations followed by 
pointwise non-linearities over increasingly larger receptive fields, and from the simplicity and scalability of the 
gradient-descent training procedure based on backpropagation. An open problem is the inclusion of layers that 
perform global, structured matrix computations like segmentation (e.g. normalized cuts) or higher-order pooling 
(e.g. log-tangent space metrics defined over the manifold of symmetric positive definite matrices) while preserving 
the validity and efficiency of an end-to-end deep training framework. In this paper we propose a sound mathematical 
apparatus to formally integrate global structured computation into deep computation architectures. At the heart of 
our methodology is the development of the theory and practice of backpropagation that generalizes to the calculus 
of adjoint matrix variations. The proposed matrix backpropagation methodology applies broadly to a variety of 
problems in machine learning or computational perception. Here we illustrate it by performing visual segmentation 
experiments using the BSDS and MSCOCO benchmarks, where we show that deep networks relying on second-order 
pooling and normalized cuts layers, trained end-to-end using matrix backpropagation, outperform counterparts that 
do not take advantage of such global layers. 


1 Introduction 

Recently, the end-to-end learning of deep architectures using stochastic gradient descent, based on very large datasets, 
has produced impressive results in realistic settings, for a variety of computer vision and machine learning domains[2, 
3, 4, 5]. There is now a renewed enthusiasm of creating integrated, automatic models that can handle the diverse tasks 
associated with an able perceiving system. 

One of the most widely used architecture is the convolutional network (ConvNet) [6, 2], a deep processing model 
based on the composition of convolution and pooling with pointwise nonlinearities for efficient classification and 
learning. While ConvNets are sufficiently expressive for classification tasks, a comprehensive, deep architecture, that 
uniformly covers the types of structured non-linearities required for other calculations has not yet been established. 
In turn, matrix factorization plays a central role in classical (shallow) algorithms for many different computer vision 
and machine learning problems, such as image segmentation [7], feature extraction, descriptor design [8, 9], structure 
from motion [10], camera calibration [11], and dimensionality reduction [12, 13], among others. Singular value 
decomposition (SVD) in particular, is extremely popular because of its ability to efficiently produce global solutions 
to various problems. 

In this paper we propose to enrich the dictionary of deep networks with layer generalizations and fundamental 
matrix function computational blocks that have proved successful and flexible over years in vision and learning mod¬ 
els with global constraints. We consider layers which are explicitly structure-aware in the sense that they preserve 
global invariants of the underlying problem. Our paper makes two main mathematical contributions. The first shows 
how to operate with structured layers when learning a deep network. For this purpose we outline a matrix general¬ 
ization of backpropagation that offers a rigorous, formal treatment of global properties. Our second contribution is to 
further derive and instantiate the methodology to learn convolutional networks for two different and very successful 
types of structured layers: 1) second-order pooling [9] and 2) normalized cuts [7]. An illustration of the resulting 
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Figure 1; Overview of the Deep02P recognition architecture made possible by our methodology. The levels 1.. .1 
represent standard convolutional layers. Layer Z + 1 is the global matrix logarithm layer presented in the paper. This 
is followed by fully connected layers and a logistic loss. The methodology presented in the paper enables analytic 
computation over both local and global layers, in a system that remains trainable end-to-end, for all its local and 
global parameters, using matrix variation generalizations entitled matrix backpropagation. 


deep architecture for O 2 P is given in hg. 1. In challenging datasets like BSDS and MSCOCO, we experimentally 
demonstrate the feasibility and added value of these two types of networks over counterparts that are not using global 
computational layers. 

2 Related Work 

Our work relates to both the extensive literature in the area of (deep) neural networks (see [5] for a review) and with 
(shallow) architectures that have been proven popular and successful in machine learning and computer vision[ i , 14, 
15, 16, 9]. While deep neural networks models have focused, traditionally, on generality and scalability, the shallow 
computer vision and machine learning architectures have often been designed with global computation and structure 
modeling in mind. Our objective in this work is to provide first steps and one possible approach towards formally 
marrying these two lines of work. 

Neural networks in their modern realization can be traced back at least to [17]. The Perceptron [18] was the 
first two layer network, although limited in expressiveness. The derivation of backpropagation[19] and its further 
advances more than a decade later [20, 21], allowed the development and the integration of new layers and the explo¬ 
ration of complex, more expressive architectures. This process lead to a successes in practical applications, e.g. for 
digit recognition [6]. More recently, the availability of hardware, the large scale datasets [2], and the development of 
complex enough architectures, lead to models that currently outperform all existing representations for challenging, 
general recognition problems. This recommends neural networks as one of the forefront methodologies for building 
representations for prediction problems in computer vision and beyond[22]. [3] then showed that even more complex, 
deeper models can obtain even better results. This lead computer vision researchers to focus on transferring this 
success to the detection and semantic segmentation problems, helds where handcrafted features[23, 24], statistically 
inspired[25, 26, 9] and deformable part models[27] were dominant at the time. R-CNN [28] uses standard networks 
(e.g. AlexNet [2] or VGG-16 [3]) to classify object proposals for detection. SDS [29] uses two input streams, one the 
original image and the second the image with the background of the region masked each with AlexNet architectures 
to take advantage of the shape information provided by the mask. He et al. [30, 31 ] propose a global spatial pyramid 
pooling layer before the fully connected layers, which perform simple max-pooling over pyramid-structured cells of 
the image. [32] uses committees to improve robustness and pushed performance close to, or beyond, human perfor¬ 
mance on tasks like traffic sign recognition and house number identihcation. In our hrst application we illustrate a 
deep architecture with a new log-covariance pooling layer that proved dominant for free-form region description [9], 
on top of manually designed local features such as SIFT. The methodology we propose makes it possible to deal with 
the difficulties of learning the underlying features even in the presence such a complex intermediate representation. 
This part is also related to kernel learning approaches over the manifold of positive-dehnite matrices [33]. How¬ 
ever, we introduce different mathematical techniques related to matrix backpropagation, which has the advantages of 
scalability and htting together perfectly with existing deep network layers. 

Among the first methods integrating structured models with CNNs is the work of [34] who showed that HMMs 
can be integrated into deep networks and showed results for speech and text analysis problems. [35] more recently 
demonstrated that using CRFs and deep networks can be trained end-to-end, showing strong results on digit recogni¬ 
tion and protein secondary structure prediction. Cast as a conditional random field (CRF) semantic segmentation has 
almost immediately taken advantage of the deep network revolution by providing useful smoothing on top of high- 
performing CNN pixel classiher predictions. [36] showed that the fully connected components, usually discarded by 
previous methods, can also be made convolutional, i.e. the original resolution lost during pooling operations can be 
recovered by means a trained deconvolution layer. [37] obtained state-of-the-art semantic segmentation results using 
an architecture similar to [36] but enforcing structure using globally connected CRFs[38] where only the unary poten¬ 
tials are learnt. Simultaneous work by [39] and [40] show that, since CRF mean held based approximate updates are 
differentiable, a hxed number of inference steps can be unrolled, the loss can be applied to them and then the gradients 
can be backpropagated back hrst through the inference to the convolutional layers of the potentials. In [41] a more 
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efficient learning method is obtained by blending inference and training in order to obtain a procedure that updates 
parameters as inference progresses. Unlike previous methods [42] learns CNN based pairwise potentials, separate 
from the CNN of the unary potential. Learning the model requires piece-wise training and minimizes an upper-bound 
on the CRF energy that decouples the potentials. 

Our matrix backpropagation methodology generally applies to models that can be expressed as composed struc¬ 
tured non-linear matrix functions. As such, it can be applied to these deep models with a CRP top structure as well 
where e.g. belief propagation in models with Gaussian potentials can be expressed as a solution to a linear system[43]. 
While CRF-based methods designed on top of deep nets traditionally focus on iterative inference and learning where 
in order to construct the derivatives of the hnal layer, one must combine the derivatives of each inference iterations, 
our methodology can be expressed in terms of invariants on the converged solution of linear systems - therefore it 
does not require iterative derivative calculations during inference. 

Our second model used to illustrate the matrix backpropagation methodology, normalized cuts, has received less 
attention from the deep network community as evidenced by the fact that leading methods are still handcrafted. 
Spectral formulations like normalized cMfi(NCuts) [7] have obtained state-of-the-art results when used with strong 
pixel-level classifiers on top of hand-designed features [44]. A different approach is taken in [45] who show that 
MRF inference can be relaxed to a spectral problem. Turaga et al [46] were the hrst to demonstrate the learning of 
an image segmentation model end-to-end using CNN features, while optimizing a standard segmentation criterion. 
Learning and inference of NCuts was placed on hrmer ground by Bach and Jordan [14] who introduced a (shallow) 
learning formulation which we build upon in this work with several important differences. First, it uses matrix 
derivatives, but makes appeal directly to the eigen-decompostion to derive them instead of projectors as we do. This 
allows them to truncate the spectrum and to consider only the eigenspace corresponding to the largest eigenvalues at 
the cost of (potentially) making the criterion non-differentiable. We instead consider the entire eigenspace and rely 
on projectors (thus on the eigen-decomposition only indirectly) and aim to learn the dimensionality in the process. 
More importantly however, instead of learning parameters on top of fixed features as in [14], we directly learn the 
affinity matrix by adapting the underlying feature representation, modeled as a deep network. The resulting method, 
combining strong pixel-level classihers and a global (spectral) representation, can more naturally adapt pixel-level or 
semi-local predictions for object detection and semantic segmentation, as these operations require not only structured, 
global computations, but also, for consistency, propagation of the information in the image. Careful application of 
our methodology keeps the entire architecture trainable end-to-end. From another direction, in an effort to generalize 
convolutions to general non-Euclidean and non-equally spaced grids the work of [47] realizes the necessity of spectral 
layers for learning the graph structure but since the computational issues brought on in the process are not the main 
focus, they do not handle them directly. In [48] such aspects are partially addressed but the authors focus on learning 
parameters applied to the eigenvalues instead of learning the eigenvectors and eigenvalues as we do. In this context 
our focus is on the underlying theory of backpropagation when handling structured objects like matrices, allowing 
one to derive those and many other similar, but also more complex derivatives. 

Symbolic matrix partial derivatives, one of the basis of our work, were hrst systematically studied in the seminal 
paper [49]', although not for complex non-linear layers like SVD or eigen-decomposition. Since then it has received 
interest mainly in the context of studying estimators in statistics and econometrics [51]. Recently, the held of auto¬ 
matic differentiation has also shown interest in this theory when considering matrix functions [52]. This very powerful 
machinery has however appeared only scarcely in computer vision and machine learning. Some instances, although 
not treating the general case, and focusing on the subset of the elements (variables) of interest, appeared in the context 
of camera calibration[53], for learning parameters in a normalized cuts model[14], learning the parameters of Gaus¬ 
sian CRFs for denoising [43] and learning deep canonical correlation models [54]. The recent surge of interest in deep 
networks has exposed limitations of current compositional (layered) architectures based on local operations, which 
in turn pushes the research in the direction of structured models requiring matrix based representations. Recently 
[55] multiplied the outputs of two networks as matrices, in order to obtain improved fine-grained recognition models, 
although the matrix derivatives in those case are straightforward. To our knowledge, we are the hrst to bring this 
methodology, in its full generality, to the fore in the context of composed global non-linear matrix functions and deep 
networks, and to show promising results for two different computer vision and machine learning models. 

Our methodological contributions are as follows: (a) the formulation of matrix back-propagation as a generalized 
chain rule mechanism for computing derivatives of composed matrix functions with respect to matrix inputs (rather 
than scalar inputs, as in standard back-propagation), by relying on the theory of adjoint matrix variations; (b) the 
introduction of spectral and non-linear (global structured) layers like SVD and eigen-decomposition which allow the 
calculation of derivatives with respect to all the quantities of interest, in particular all the singular values and singular 
vectors or eigen-values and eigen-vectors, (c) the formulation of non-linear matrix function layers that take SVD 
or eigen-decompositions as inputs, with instantiations for second-order pooling models, (d) recipes for computing 
derivatives of matrix projector operations, instantiated for normalized-cuts models, (e) The novel methodology (a)- 
(d) applies broadly and is illustrated for end-to-end visual learning in deep networks with very competitive results. 

’ With few additions from disparate sources, these are the matrix derivatives results catalogued in the collection of matrix identities called “The 
Matrix Cookbook”[50]. 
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Paper organization: In the next section §3 we briefly present the main elements of the current deep network models. 
In §4.2 we outline the challenges and a computational recipe to handle matrix layers. §5 presents a “matrix function” 
layer using either SVD or an EIG decomposition and instantiated these to build deep second-order pooling models. 
In §6, we introduce an in-depth treatment to learn deep normalized cuts models. The experiments are presented in §7. 


3 Deep Processing Networks 


Let T> = jv be a set of data points (e.g. images) and their corresponding desired targets (e.g. class 

labels) drawn from a distribution p(d, y). Let L : M be a loss function i.e. a penalty of mismatch between the 

model prediction function f : —>■ with parameters W for the input d i.e. /(d^), W) and the desired output 

The foundation of many learning approaches, including the ones considered here, is the principle of empiri¬ 
cal risk minimization, which states that under mild conditions, due to concentration of measure, the empirical risk 
R{W) = i L(/(db), lL),yb)) converges to the true risk i?(lL) = / L(/(d, VL), y)p(d, y). This implies 
that it suffices to minimize the empirical risk to learn a function that will do well in general i.e. 


arg mm 

w N 


1 ^ 

-^L(/(dW,W),y«) 


( 1 ) 


If L and / are both continuous (though not necessarily with continuous derivatives) one can use (sub-)gradient descent 
on (1) for learning the parameters. This supports a general and effective framework for learning provided that a (sub-) 
gradient exists. 

Deep networks, as a model, consider a class of functions /, which can be written as a series of successive function 
compositions / = o o ... o with parameter tuple W = (wjc, 'Wk-i, • ■ ■, Wi), where are called 

layers, w; are the parameters of layer I and K is the number of layers. Denote by = Lo f^^'> o ... o /(*) the loss 
as a function of the layer x;_i. This notation is convenient because it conceptually separates the network architecture 
from the layer design. 

Since the computation of the gradient is the only requirement for learning, an important step is the effective 
use of the principle of backpropagation (backprop). Backprop, as described in the literature, is an algorithm to 
efficiently compute the gradient of the loss with respect to the parameters, in the case of layers where the outputs 
can be expressed as vectors of scalars, which in the most general form, can individually be expressed as non-linear 
functions of the input. The algorithm recursively computes gradients with respect to both the inputs to the layers and 
their parameters (fig. 2) by making use of the chain rule. Lor a data tuple (d, y) and a layer I this is computing 


dL(')(xi_i,y) aL('+i)(xi,y) 5/(')(xi_i) 


dvzi 

dLW(xi_i,y) 


dx; dvzi 

5L('-tl)(x;,y) a/(')(Xi_i) 


( 2 ) 


(3) 


dxi-i dxi dx;_i 

where X; = /(^)(x/_i) and Xq = d (data). The first expression is the gradient we seek (required for updating w;) 
whereas the second one is necessary for calculating the gradients in the layers below and updating their parameters. 


4 Structured Layers 

The existing literature concentrates on layers of the form = (/|^^(xi_i),..., (xi_i)), where —>■ 


. This simplifies processing significantly because in order to compute 


there is a well defined notion of partial derivative with respect to the layer 




dxi-i 


as well as a simple expression 


5xi_i 

for the chain rule, as given in (2) and (3). However this formulation processes spatial coordinates independently and 
does not immediately generalize to more complex mathematical objects. 

Consider a matrix view of the (3-dimensional tensor) layer, X = Xi_i, where G K, with i being the spatial 
coordinate^ and j the index of the input feature. Then we can define a non-linearity on the entire X G as a 

matrix, instead of each (group) of spatial coordinate separately. As the matrix derivative with respect to a vector (set 
aside to a matrix) is no longer well-defined, a matrix generalization of backpropation is necessary. 

Lor clarity, one has to draw a distinction between the data structures used to represent the layers and the mathemat¬ 
ical and computational operations performed. Lor example a convolutional neural network layer can be viewed, under 
the current implementations, as a tensor where two dimensions correspond to spatial coordinates and one dimension 
corresponds to features. However, all mathematical calculations at the level of layers (including forward processing 


^For simplicity and without loss of generality we reshape (linearize) the tensor’s spatial indices to one dimension with mi coordinates. 
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Figure 2: Deep architecture where data x and targets y are fed to a loss function L, via successively composed 
functions with parameters w/. Backpropagation (blue arrows) recursively expresses the partial derivative of the 
loss L w.r.t. the current layer parameters based on the partial derivatives of the next layer, cfeq. 2. 


or derivative calculations) are not expressed on tensors. Instead these are performed on vectors and their scalar out¬ 
puts are used to selectively index and hll the tensor data structures. In contrast, a genuine matrix calculus would not 
just rely on matrices as data structures, but use them as hrst class objects. This would require a coherent formulation 
where non-linear structured operations like forward propagation or derivative calculations are directly expressed using 
matrices. The distinction is not stylistic, as complex matrix operations for e.g. SVD or eigen-decomposition and their 
derivatives simply cannot be implemented as index-filling. 


4.1 Computer Vision Models 

To motivate the use of structured layers we will consider the following two models from computer vision: 

1. Second-Order Pooling is one of the competitive hand-designed feature descriptors for regions [9] used in the 

top-performing method of the PASCAL VOC semantic segmentation, comp. 5 track [56, 57]. It represents 
global high-order statistics of local descriptors inside each region by computing a covariance matrix F, 
where F G is the matrix of image features present in the region at the m spatial locations with d feature 

dimensions, then applying a tangent space mapping [58] using the matrix logarithm, which can be computed 
using SVD. Instead of pooling over hand-designed local descriptors, such as SIFT [59], one could learn a deep 
feature extractor (e.g. ConvNet) end-to-end, with an upper second-order pooling structured layer of the form 

C = log(F^F + eI) (4) 

where e/ is a regularizer preventing log singularities around 0 when the covariance matrix is not full rank. 

2. Normalized Cuts is an influential global image segmentation method based on pairwise similarities [7]. It 

constructs a matrix of local interactions W = FF^, where F G is a similar feature matrix with m 

spatial locations and d dimensions in the descriptor, then solves a generalized eigenvalue problem to determine 
a global image partitioning. Instead of manually designed affinities, one could, given a ground truth target 
segmentation, learn end-to-end the deep features that produce good normalized cuts. 


4.2 Matrix Backpropagation 

We call matrix backpropagation (MBP) the use of matrix calculus [49, 51, 52] to map between the partial derivatives 
—-and -at two consecutive structured layers. Note that since for all layers I the function maps to real 

OX; OX;_i 

numbers, by construction, both derivatives are well defined. In this section we simplify notation writing L = 

X, Y are the matrix versions of x; and x;_i respectively, / = thus f(X) = Y. 

The basis for the derivation is best understood starting from the Taylor expansion of the matrix functions[51] at 
the two layers 


Lof{XFdX)-Lo f{X) = : dX + 0(||dXf) (5) 

BL 

L(Y + dY)-LiY) = — :dY + 0{\\dYf) (6) 

where we introduced the notation A : B = Tr(A^i?) = vec(A)^ vec{B) for convenience. Thus A : B is an inner 
product in the Euclidean vec’d matrix space. 

Our strategy of derivation, outlined below, involves two important concepts. A variation corresponds to the 
forward sensitivity and allows the easy manipulation of the first and higher order terms of a Taylor expansion. E.g. 
for a matrix function g we write dg = dg{X; dX) = g{X dX) — g{X) = A{X) : dX -f 0(||dAr|p), with A(X) a 
matrix of the same size as X and depending on X but not on dX. The (partial) derivative is by definition the linear 
‘coefficient’ of a Taylor expansion i.e. the coefficient of dX ergo ^ = A(X). The variation and the partial derivative 
are very different objects: dg is always defined if g is defined, it can take matrix inputs, and can map to a space of 
matrices. In contrast, the partial derivative also makes sense when g has matrix inputs, but is only defined when g has 
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scalar co-domain (image)^. The variation is used for the convenience of the derivation and needs not be implemented 
in practice. What we are ultimately after, for the purpose of matrix backpropagation, is the partial derivative. 

The important element to understand is that when 

dY = df{X-dX) (7) 


the expressions (5) and (6) should be equal, since they both represent the variation of L for a given perturbation dX 
of the variable X. The first order terms of the Taylor expansions should also match, which gives us the chain rule 


dLo f 
dX 


:<i.Y = §:dy 


( 8 ) 


The aim is to use this identity to express the partial derivative of the left hand side as a function of the partial derivative 
in the right hand side. The general recipe for our derivation follows two steps^: 

1. Derive the functional C describing the variations of the upper layer variables with respect to the variations of 
the lower layer variables 

dY = C{dX) = df{X-, dX) (9) 

The derivation of the variation involves not only the forward mapping of the layer, but also the invariants 
associated to its variables. If X satisfies certain invariants, these need to be preserved to first (leading) order 
when computing dX. Invariants such as diagonality, symmetry, or orthogonality need to be explicitly enforced 
in our methodology, by means of additional equations (constraints) beyond (9). 

2. Given dY produced in step 1 above, we know that (8) holds. Thus we can use the properties of the matrix inner 
product A ■. B = Tr(A^i?) to obtain the partial derivatives with respect to the lower layer variables. Since the 

operator is an inner product on the space of matrices, this is equivalent to constructively producing £*, a 
non-linear adjoint operator^ of C 


dL 

W 


Fit 

■■dY=^: C{dX) 4 C* 



■.dX^C* 



dLof 

dX 


(by the chain rule) (10) 


This holds for a general variation, e.g. for a non-symmetric dX even if X itself is symmetric. To remain within 
a subspace like the one of symmetric, diagonal or orthogonal matrices, we consider a projection of dX onto 
the space of admissible variations and then transfer the projection onto the derivative, to obtain the projected 
gradient. We use this technique repeatedly in our derivations. 

Summarizing, the objective of our calculation is to obtain ■ Specifically, we will compute ^ (typically back- 
propagated from the layer above) and dY = C{dX), then process the resulting expression using matrix identities, in 
order to obtain an analytic expression for : C{dX). In turn, extracting the inner product terms (^) : dX 
from that expression, allows us to compute C*. 


5 Spectral and Non-Linear Layers 

When global matrix operations are used in deep networks, they compound with other processing layers performed 
along the way. Such steps are architecture specific, although calculations like spectral decomposition are widespread, 
and central, in many vision and machine learning models. SVD possesses a powerful structure that allows one to 
express complex transformations like matrix functions and algorithms in a numerically stable form. In the sequel we 
show how the widely useful singular value decomposition (SVD) and the symmetric eigenvalue problem (EIG) can 
be leveraged towards constructing layers that perform global calculations in deep networks. 


5.1 Spectral Layers 


The SVD layer receives a matrix X as input and produces a tuple of 3 matrices 11,11 and V. Under the notation 
above, this means Y = f{X) = {U,T,,V). The matrices satisfy the regular invariants of the SVD decomposition 
i.e. X = UYV^, U^U = I, V^V = I and E is diagonal which will be taken into account in the derivation. The 
following proposition gives the variations of the outputs i.e. C{dX) = dY = {dU, dY, dV) and the partial derivative 


with respect to the layer 


dLof 

dX 


as a function of the partial derivatives of the outputs 


dL 

dY 


dL dL . dL 

“■ m-Ss ““ W- 


that these correspond, respectively, to the first and second step of the methodology presented in §4.2. In the sequel, 
we denote Asym = ^ {A~^ + A) and Adiag be A with all off-diagonal elements set to 0. 


^See [51] for an in-depth treatment including questions about existence and uniqueness of the partial derivatives of matrix functions. 
^The appendices contain some of the basic identities used in each of these steps. 

^For arguments a, b, the adjoint operator C* of C is defined as having the property that a : C{b) = £* (a) : b. 
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Proposition 1 (SVD Variations). Let X = UXV^ with X € M™’” and m> n, such that U^U = I, V^V = I and 
S possessing diagonal structure. Then 

( 11 ) 


and 

with 

Let € 


dX = {U^dXV)d^ag 
dV = 2V {K^ o {Y^U^dXV)sym) 


i=j 


( 12 ) 


(13) 


be the top n rows ofS and consider the block decomposition dU = {dUi \ dU- 2 ^ with dUi G 


and dU 2 G 


pmxm—n 


and similarly 


dL 

W 


aL\ 

W), 


dL 

w 


, where 


dL 

w 


and 


dL 

m 


^mxm—n 


. Then 


with 

Consequently the partial derivatives are 

'dL 


dU = (CE-i I - C/iS-iC^C/ 2 ) 

c = dxv - udx - umv^v 


dLof -r 

= DV^ + U 

dX 


9E 


- U^D 


V ' + 2C/S K' o V 


where 


diag 


dL 


dL 

dV 


- VD^UE 


V' 


dU 


D= ^u-U2 ^, U,X 


dL\ 


dUj. 


(14) 

(15) 

(16) 

(17) 


Proof. Let X = UY.V^ by way of SVD, with X e and m > n, E e diagonal and U G 

V G orthogonal. For a given variation dX of X, we want to calculate the variations dU,dX and dV . The 

variation dX is diagonal, like S, whereas dU and dV satisfy (by orthogonality) the constraints dU + dU^U = 0 
and dV + dV^V = 0 respectively. Taking the first variation of the SVD decomposition, we have 


dX = dUEV^ + UdXV^ + 


(18) 


then, by using the orthogonality of {7 and V, it follows that 


^U^dXV = U^dUY. + dX + YdV^V ^ 

=^R = AY + dY + YB 

with R = dXV and A = U^dU, B = dV^V both antisymmetric. Since dY is diagonal whereas AY, YB 
have both zero diagonal, we conclude that 

dY = {U^dXV)d^ag (19) 

The off-diagonal part then satisfies 

AY + YB = R — Rdiag E^TlE + Y^YB = S^(i? — Rdiag) = 

f ^i^ij^j “t“ ^ij 


-ajaijdi — a^bij = crjRji (A,B antisym.) 


(fj^ (7j ^bij — UiRij T RjiCj b^j — 


^ + Rji<Jj) , 


( 20 ) 


^0, i = j 

where = Yu and R = R — Rdiag- We can write this as B = K o (YA R + R^ Y) = K o (YA R + RA Y), where 


1 


K,,n = < crt- < 
0 , 


r. iAj 


( 21 ) 


i=j 


Finally, 

dV = VB^ =>dV = 2V (K^ o {Y^U^dXV)sym) (22) 

Note that this satisfies the condition V^dV + dV^V = 0 by construction, and so preserves the orthogonality of V to 
leading order. 
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Using the dE and dV we have obtained, one can obtain dU from the variations of dX in (18): 

dX = dUYX^ + UdXV^ + UHdV^ ^ dUH = dXV - UdY. - UT.dV^V =: C 

This equation admits any solution of the block form dU = [dUi dU 2 ), where dUi := CT,~^ € ]^mxn being the 
top n rows of S) and dU 2 G arbitrary as introduced in the proposition. To determine dU 2 uniquely we turn 

to the orthogonality condition 


dU' U + U' dU = Q 


(dUj Ui + Uj dUi dU^ U 2 + dU2 
\dUj Ui + Uj dUi dUj U 2 + Uj dU2 


= 0 


The block dUi already satishes the top left equation, so we look at the top right (which is equivalent to bottom left). 
Noting that Ui Ui — Ihy the orthogonality of U, we can verify that dU 2 = —UidUjU 2 - Since this also satishes the 
remaining equation, orthogonality is satished. Summarizing 


dU = (CE-^ I - UiY.-^C^U2) , C = dXV - UdX - UY.dV^V 


(23) 


dLof 

We proceed further with the second part of the matrix backpropagation to compute — „,, . Note that the chain 


rule in this case is —- : dX = —— : dU 

oX oU 

expressions w.r.t. dX to obtain the partial derivatives. 

Before computing the full derivatives we simplify slightly the expression corresponding to dU 


dX 

dL dL 

: dYj + TT— : dV. We can replace dU, dV and dE with their 
aE oV 


dL 


dU 



dL 


LX,dU={^\ : CE-^ + ( ^ 1 : -Ui^-^C^U2 


dU 

E-i:C-E-^[/i' 


2 

-IrrT 


E-i :C-C/2 


- U 2 


dL 

W 


dL 

W 

T 


c/ 2 ' :C' 


C/iE-i : C 


dlX\ 

Wj 


) C/iS 
2 


-1 


: (dXV - Ud^ - UEdV^V) 


= D : dXV - D : C/dE - D : UmV^V 
= DV^ : dX - U^D : dE - EC/^DU^ : dU^ 

= DV^ : dX - U^D : dE - VD^UT, : dV 

Now we can plug this result in the full derivative 

dr 8T dr dr 

^ : dC/ + — : dE + — : dU ={DV^ : dX - D : dE - VD^UE : dV) + — : {U^dXV)d^ag+ 

dr 

+ —:{2V {K^ o {E^U^dXV)sym )} 


(24) 

(25) 

(26) 

(27) 

(28) 

(29) 

(30) 


--DV^ ■dX+(^^- : iU^dXV)d^ag+ 

+ VD^UJ^j : {2V {K^ o dXV)} 

=DV^ -.dX+l^- U^d] : {U^dXV)+ 


diag 


+ 2V^ - VD^UE^ : {K^ o {E^dXV)sym) by (68), (69) 


=DV ' -.dX + Ui^- U^D 


dL 


V ' : dX+ 


diag 


+ 2[K' o[V' [—-VD'UE 


: E^U^dXV 


sym 


=DV' -.dX + Ul^- U^D 


dE 


V ' : dX+ 


diag 


+ 2UE K 


dL 


V' \ —-VD'UE 


: dX 


by (70),(71) 


by (68) 


sym 
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dL o f 

and so, since the last expression is equal to —- : dX by the chain rule, 

aX 

^ (i - ° {w - 


V 


sym 


(31) 


□ 


The EIG is a layer that receives a matrix X as input and produces a pair of matrices U and S. Given our 
notation, this means Y = f{X) = (17, S). The matrices satisfy the regular invariants of the eigen-decomposition 
i.e. X = UTjU^, U^U = I and S is a diagonal matrix. The following proposition identifies the variations of the 

outputs i.e. C{dX) = dY = {dU, dS) and the partial derivative with respect to this layer — _ ^ as a function of the 


^ u dL . dL dL 

partial derivatives of the outputs —— i.e. —— and . 

dY dU dY 


dX 


Proposition 2 (EIG Variations). Let X = UYU^ with X G K™’"*, such that U^U = I and E possessing diagonal 


structure. Then 

and 

with 

The resulting partial derivatives are 

dLo f 


dY = {U^dXU)d^ag 
dU = U {K^ o (U^dXU)) 

, 3 

i=j 


K,j ^ Uj 


(32) 

(33) 

(34) 




u' 


(35) 


diag 


Proof. Eirst note that (19) still holds and with the notation above we have in this case m = n, U = V. This implies 

dY = iU^dXU)d^ag (36) 

Furthermore we have A = {A, B antisymmetric) and the off-diagonal part then satisfies AY + YA^ = 

R — Rdiag- In a similar process with the asymmetric case, we have 


AY -\- YA — R — Rdiag 
so that A = o R with 

From this, we get then 


=> AY — YA = i? 


, i¥= 3 


tXijlTi — Rij ^ tj f J 
Qij =0, i = j 


Ku = { (T^- cr,- 


* =3 


dU = U o (U^dXU)'^ 


(37) 


(38) 


dL o f dL dL 

Note that the chain rule in this case is —-- : dX = —— : dLf + - 7 — : dY, we can replace dU and dY with 

dX dLf dY 

their expressions w.r.t. dX to obtain the partial derivatives 




:dU+^:dY=^:{u[K^o (U^dXU)) } + ^ : (U^dXU) 


dL 


dL 


diag 




and so 




d_L\\^(d_L\ 


U' 


9^Jd^ag) 

Note that (19), (38), (37) and (39) represent the desired quantities of the proposition. 


(39) 

□ 
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5.2 Non-Linear Layers 


Using the SVD and EIG layers presented above we are now ready to produce layers like O 2 P that involve matrix 
functions g, e.g. g = log, but that are learned end-to-end. To see how, consider the SVD of some deep feature 
matrix F = and notice that g{F^F + el) = g{VTJUT,V^ + eVV^) = Vg{T,^T, -f eI)V^, where 

the last equality is obtained from the definition of matrix functions given that Schur decomposition and the eigen- 
decomposition coincide for real symmetric matrices [60], Thus to implement the matrix function, we can create a new 
layer that receives the outputs of the SVD layer and produces Vg^FJ 11+eI)V^, where g is now applied element-wise 
to the diagonal elements of S^S + el thus is much easier to handle. 

An SVD matrix function layer receives as input a tuple of 3 matrices U,F and V and produces the response 
C = Vg{F^F + eI)V^, where g is an analytic function and e is a parameter that we consider fixed for simplicity. 
With the notation in section §4.2 we have X — {U,F,V) and Y = f{X) = Vg{F^F + eI)V^. The following 
proposition gives the variations of the outputs are i.e. C{dX) = dY = dC and the partial derivatives with respect 
dL o / . f dL o f dL o f\ ^ .... . dL 


to this layer 


dX 


i.e. 


dV ’ dF 


as a function of the partial derivatives of the outputs -j—. Note that the 

ui-y 

dL 


layer does not depend on U so its derivative —— = 0. 

oU 

Proposition 3 (SVD matrix function). An (analytic) matrix function of a diagonalizable matrix A = VFV^ can 
be written as g{A) = Vg{F)V^. Since F is diagonal this is equivalent to applying g element-wise to F’s diagonal 
elements. Combining this idea with the SVD decomposition F = UFV^, our matrix function can be written as 
C = g{F^F + el) = Vg{F^F + eI)V^. 

Then the variations are 


dC = 2 {dV g{F^F + eI)V^)^ ^ + 2 {VgfF^F -f eI)F^dFV^) 


and the partial derivatives are 


and 


\ / sym 




V 


(40) 


(41) 


Proof Using the fact that for a positive diagonal matrix A and a diagonal variation dA, g(^A + dA) = g{A) 
g'{A)dA + 0(||dA|p), we can write 


dC = 2 {dVg{F^F + eI)V)+ 2 {Vg'{F^F + eI)F^dFV^) 


The total variation dL of an expression of the form L = g{C), g :. 


sym 

can then be written as: 


dL 

W 


+ 2 (r9'(ETE + 


'dL" 


: {dVg{F^F + eI)V^) + 2 


'dL 

dC 


: {Vg'{F'F + eI)F' dFV') 




by (70) 
by (68) 


By the chain rule, we must have 


dL 

W 


dC = 


dLo f 
dV 


dV + 


dLof 

dF 


dF 


dLof _ n ( dL\ 
dV ~ ^ \ del sym 


Vg{F^F + eI) 


^ = 2Fg'iF-^F + eI)V^ 


(42) 


□ 


Similarly the EIG matrix function layer receives as input a pair of matrices U and Q and produces the response 
C = Ug{Q)U^. With the notation from §4.2 we have X — {U,Q) and Y — f{X) — Ug{Q)U^. Note that if 
the inputs obey the invariants of the EIG decomposition of some real symmetric matrix Z = UQU^ i.e. U are the 
eigenvectors and Q the eigenvalues, then the layer produces the result of the matrix function C = g{Z). This holds 
for similar reasons as above g{Z) = g{UQU^) = Ug{Q)U^, since in this case the Schur decomposition coincides 
with the eigen-decomposition [60]. The following proposition shows what the variations of the outputs are in this case 


i.e. C{dX) = dY = dC and what the partial derivatives with respect to this layer are ^ i.e. 


dX 


as a function of the partial derivatives of the outputs . 


dLof dLof 


dU 


dQ 
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Proposition 4 (EIG matrix function). Let Z = UQU^ by way of eigen-decomposition (symmetric SVD), with Z £ 
5'+(to) an m X m real symmetric matrix. Then Q £ diagonal (the strictly positive eigenvalues) and 

U £ is orthogonal (the eigenvectors). Denote with C = g{Z) = Ug{Q)U^ Then the variations of C are 

given by 

dC = 2{dUg{Q)U^)sym + Ug'{Q)dQU^ (43) 

and the partial derivatives are 

(44) 

(45) 

Proof. The variation of C is 


dLof 

dU 


= 2 


dL 

dC 


Ug{Q) 


sym 




dC = dUg{Q)U^ + Udg{Q)U^ + Ug{Q)dU^ ^ dC = 2{dU g{Q)U^) + U g'{Q)dQU^ 


We consider the variation of L, 


dL 


dL 


: rfC = ^ : {2{dUg{Q)U^)sym + Ug'{Q)dQU^] 


dL 

= g\Q)U^ — U-.dQ + 2 


dL 


Ug{Q) : dU 


sym 


By the chain rule, we must have 


— ■ ^ 
dC ■ dU 


: dU + 


dLof 

dQ 


: dQ 


dLof 

dU 

dLof 

dQ 



= g'{Q)U^ 


dC 


Ug{Q) 

U 


□ 


Now it is trivial to derive two versions of the O 2 P descriptor (4) by plugging in log and its derivative in the 
propositions above. 

Corollary 1 (Deep02P). Deep O 2 P layers can be implemented and have the following backpropagation rules 
1. Deep02P-SVD: 


dLof 

dV 



yiog(E^E + e/) and 

sym 


dLof 

5S 


2E(E^E + e/)-iE^ 



(46) 


2. Deep02P-EIG: 


dLof 

dU 



U\og{Q) and 

sym 


dLof 

dQ 


= Q-^ 




(47) 


Proof. If g{A) = log(2l) then g'{A) = A Plugging these into (40) and (41) we obtain the Deep02P-SVD 
derivatives above. Similarly, plugging into (44) and (45) gives the Deep02P-EIG derivatives. □ 


6 Normalized Cuts Layers 

A central computer vision and machine problem is grouping, segmentation or clustering, i.e. discovering which 
datapoints (or pixels) belong to one of several partitions. A successful approach to clustering is normalized cuts. Let 
TO be the number of pixels in the image and let E = to} be the set of indices. We want to compute a partition 

V = {Pi, ..., Pfc}, where k = |P|, Pi C V, {J^Pi = V and Pj f} P^ = 0. This is equivalent to producing a 
matrix E £ {0, such that E{i,j) = 1 if i £ Pj and 0 otherwise. Let P £ be a data feature matrix 

with descriptor of size d and let IE be a data similarity matrix with positive entries. Eor simplicity we consider 
W = EAE^, where A is a d x d parameter matrix. Note that one can also apply other global non-linearities on top 
of the segmentation layer, as presented in the previous section. Let D = [VLl], where [v] is the diagonal matrix with 
main diagonal v, i.e. the diagonal elements of D are the sums of the corresponding rows of W. The normalized cuts 
criterion is then 

C(W, E) = Pi[E^WE(,E^DE)-^) ( 48 ) 
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Finding the matrix E that minimizes C{W, E) is equivalent to finding a partition that minimizes the cut energy but 
penalizes unbalanced solutions. 

It is easy to show[14] that C{W, E) = k — where Z is such that; a) Z^Z = I, and b) 

D^I'^Z is piecewise constant with respect to E (i.e. it is equal to E times some scaling for each column). Ignoring the 
second condition we obtain a relaxed problem that can be solved, due to Ky Fan theorem, by an eigen-decomposition 
of 

M = £)-i/2vp£)-i/2 (49) 

[14] propose to learn the parameters A such that D^/^Z is piecewise constant because then, solving the relaxed 
problem is equivalent to the original problem. In [14] the input features were fixed, thus A are the only parameters to 
permit the alignment. This is not our case, as we place a global objective on top of convolutional network inputs. We 
can can therefore take leverage the network parameters in order to change F directly, thus training the bottom layers 
to produce a representation that is appropriate for normalized cuts. 

To obtain a Z that is piecewise constant with respect to D^f^E we can align the span of M with that of 11 = 
D^/^EE^. For this we can use projectors 11^ of the cotTesponding space spanned by A, where 11^ = AA+ 
is an orthogonal projector and A+ is the Moore-Penrose inverse of A. The alignment is achieved by minimizing the 
Frobenius norm of the projectors associated to the the model prediction IIm and the desired output Iln, respectively 

MW,E) = ^\\IlM-Tlnfp (50) 


Notice that while our criterion Ji is superficially similar to the one in [14], there are important differences. [14] 
truncate the spectrum and consider only the eigenspace corresponding to the largest eigenvalues at the cost of (poten¬ 
tially) making the criterion non-differentiable. In contrast, we consider the entire eigenspace and rely on projectors 
(and only indirectly on eigen-decomposition) aiming to also learn the dimensionality of the space in the process. 

We will obtain the partial derivatives of an objective with respect to the matrices it depends on, relying on matrix 
backpropagation. Since the projectors will play a very important role in a number of different places in this section 
we will treat them separately. 

Consider a layer that takes a matrix A as input and produces its corresponding orthogonal projector 11^ = AA~^. 
In the notation of section 4.2, X = A and Y = f{A) = A a- The following proposition gives the variations of the 

outputs i.e. L{dX) = dY = and the partial derivative with respect to the layer —as a function of the 


partial derivatives of the outputs i.e. 


dL 


dX 


Lemma 1. Consider a symmetric matrix A and its orthogonal projection operator 11^. IfdA is a symmetric variation 
of A then 

dAA = 2{{I-AA)dAA+) ) (51) 


and 


A- 




(52) 


sym 


Proof. (We drop the subscript of 11^ for brevity.) Taking the variation of the basic properties of the projector 11^ = If 
and If A = A, we have 


dnn -f ndn = dn (53) 

dAA + AdA = dA (54) 

We then consider the following decomposition of dll 

dA = AMA -f (J - n)Qn -f AQ^{I - A) + {I- A)R{I - H) 

with M and R symmetric, so that dll is symmetric by constmction. Plugging into the equations above, we obtain 

2AMA -f (/ - n)Qn -f AQ^ (/ - n) = dA 
AMA + {I- A)QA = (/ - A)dA 

Comparing the first equation with the decomposition of dll above, we infer that M = R = 0, and so 

(/ - n)Qn -f AQ^{I -A) = dA 

(/ - n)QA = (/ - n)dA 

Multiplying the second equation with A+ at the right hand side gives (/ — IljQII = (/ — n)dAA+. Plugging this 
into the first equation gives the desired result for the variations. 
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Now we can calculate the partial derivative 


dL 


: (Ma 




by (51) 


and so 


dLof 

dA 


= 2{I-IIa) 


( — ) 
[dUAj 


A+ 


sym 


(55) 

□ 


The derivation relies only on basic properties of the projector with respect to itself and its matrix: 11^ = 11^ 
(idempotency of the projector) and IIaA = A (projector leaves the original space unchanged). Note that since 
= AA^, there exists a non-trivial spectral decomposition in training, although it is ‘notationally’ hidden under 
A+, which nevertheless requires an SVD computation. 

From the perspective of matrix backpropagation we split the computation of Ji into the following 4 layers F 
W —>■ {M, n) —> (IIm, Flo) —> Ji- We consider them in reverse order from the objective down to the inputs. First 

dJi dJi 

the derivative of the Frobenius norm is well known[50] so —-= IIm — Ho and —-= Ho — IIm- Then we focus 

oAm oHo 

on the layer taking as inputs M or Q and producing the corresponding projectors i.e. IIm and IIq. These derivatives 
are obtained by applying Lemma 1 . 

Subsequently, we consider the layer receiving {W, E) as inputs and producing (M, ^l). Under the notation intro¬ 
duced in §4.2, L = Ji, X = (W, E) and Y = f{X) = {M, U) as dehned above. The following proposition gives the 

variations of the outputs i.e. C{dX) = dY = {dM, dQ) and the partial derivative with respect to the layer ^ "" 


e ■ f u • 1 j f u dL. f dL dL 

a function of the partial derivatives of the outputs —— i.e. ——, 7 — 

dY \dM dil 

Proposition 5. With the notation above, the variations of M and U are 

dn = (nD-Adwi]) 

\ V sym 

and 

dM = D-^/'^dWD-^/^ - {MD-^[dWl])^ 
and the partial derivative of Ji with respect to W is 

, ,,, («) _ o-H. (|L) 


dX 


(56) 


(57) 


sym > 


Proof. For a diagonal matrix D under a diagonal variation dD, we can show that d{DP) = pD^ ^dD by means of 
element-wise differentiation. For the particular D = [FFl], we have dD = [dlUl]. Using these, we obtain 


dLl = 


and 


^dDD-'^/'^EE^ D^/'^F-D^/'^EE^ D-'^/'^dD = (d^/'^EE^ D-^/‘^dD\ = {nD-^[dWl]) 

2 2 V / sym 


dM = --dDD-^/'^WD-^/^ + D-^/'^dWD-^/'^ - -D-^/'^WD-^/'^dD 
2 2 


sym 


(58) 


= D-^^'^dWD-^l'^ - {MD-\dWl]) 


sym 


Then, plugging in the variations we compute the partial derivative 




dJi 


dM ' dn 

then identifying we obtain 


dJ^ 


dM 


.dL 


dM, 


[dwi] + ( D-^n^ 

C/i L sym 


[dWl] 


^ ^- 1/2 tF/i 

dW dM 


.dJi 


-pdiag 

where we used the property A : [Bx] = Aii[BijXj) = {AiiXj)Bij = (diag(A)a;^) : B. 


-D-^M 

C/iZ sym uAI sym 


□ 
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A related optimization objective also presented in [14] is 

= ( 59 ) 

with 'h = E{E^E)~^. Here we consider H^y = V{V^V)~^V^, where V = D^/'^U. We observe that this is a 
projector for W by noting that Hy/ = D^/'^U{U^DU)~^U^D^/"^ and M = UEU^ = , by eigen 

decomposition and (49). Then indeed 

1. Idempotency of IIvi/ 


DU)-^U^ DU(U^ DU)-^U^ = H^ 

2 . Hw leaves 14^ unchanged 


UwW = 

= D-^/^ = W 


QJ2 

Proposition 6. The corresponding partial derivative —— is 

oW 

f) 7 

^=-2{I-Uw)Il^W+ 

dJ 

Proof. Since 4/ does not depend on W, then = 0, so the derivation is much simpler 


dJ2 

dW 


= -2(1-n 


w) 


f dJ2 


Van 


w 




= -2(1 - nw)(Uw - Il^)W+ 
= -2(1 -Ilw)Tl^W+ 


by Lemma 1 

by Frobenius derivative 
by idempotency of projector 


(60) 


( 61 ) 

(62) 

(63) 


□ 


Finally, in both cases, we consider a layer that receives A and F as inputs and outputs the data similarity 
W = FAF^. Following the procedure of section 4.2, hrst we compute the hrst order variations dW = dFAF^ + 
FdAF^ + FAdF^. We then use the trace properties to make the partial derivatives identihable 




= F 


dJ, 

dW^ 


F’ : dA + 2 


( ^ 
I 9VF 


FA' 


sym 


sym 



Thus we obtain 


and 


^ = pT dA 
dA dW 


dF 



FA^ 

sym 


(64) 

(65) 


dJ 

Note that when J = J 2 then = 0, since (I — I{w)F = F^(I — Hy/) = 0. Thus we cannot learn A 
by relying on our projector trick, but there is no problem learning F, which is our objective, and arguably more 
interesting, anyway. 

An important feature of our formulation is that we do not restrict the rank in training. During alignment, the 
optimization may choose to collapse certain directions thus reducing rank. We prove a topological lemma implying 
that if the Frobenius distance between the projectors (such as in the two objectives Ji, J 2 ) drops below a certain 
value, then the ranks of the two projectors will match. Conversely, if for some reason the ranks cannot converge, the 
objectives are bounded away from zero. The following lemma shows that when the projectors of two matrices A and 
B are close enough in the 11-112 norm, then the matrices have the same rank. 


Lemma 2. Let A,B€ and H^, 11^ their respective orthogonal projectors. If ||n^ — nB ||2 < 1 then 

rank A = rank B. 
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Proof. The spectral norm 11-112 can indeed be defined as \\A \\2 = sup||^||^^g We assume that the ranks of A and 

B are different, i.e. w.l.o.g. rank A > ranki?. By the fundamental theorem of linear algebra there exists a vector v 
in the range of A (so that Aav = v), that is orthogonal to the range of B (so that Abv = 0). We have then 


||n^ — nB||2 > 


IIIIau - IIbuII 

Ikll 


lin^ull 


which is a contradiction. 


□ 


Given that the Frobenius norm controls the spectral norm, i.e. ||A ||2 < PIIf (§2.3.2 of [60]), 
corollary is that when J 2 is bounded above by 1/2, then ||^||2 < 1 and the spaces spanned by W 
perfectly aligned, i.e. 


J 2 {W) < ^ ^ rank(VF) = vsiak{EE^) 


an immediate 
and EE^ are 

( 66 ) 


7 Experiments 

In this section we validate the proposed methodology by constructing models on standard datasets for region-based 
object classification, like Microsoft COCO [61], and for image segmentation on BSDS [44]. A matconvnet [62] 
implementation of our models and methods is publicly available. 

7.1 Region Classification on MSCOCO 

For recognition we use the MSCOCO dataset [61], which provides 880k segmented training instances across 80 
classes, divided into training and validation sets. The main goal is to assess our second-order pooling layer in various 
training settings. A secondary goal is to study the behavior of ConvNets learned from scratch on segmented training 
data. This has not been explored before in the context of deep learning because of the relatively small size of the 
datasets with associated object segmentations, such as PASCAL VOC [63]. 

The experiments in this section use the convolutional architecture component of AlexNet[2] with the global O 2 P 
layers we propose in order to obtain Deep02P models with both classification and fully connected (FC) layers in the 
same topology as Alexnet. We crop and resize each object bounding box to have 200 pixels on the largest side, then 
pad it to the standard AlexNet input size of 227x227 with small translation jittering, to limit over-fitting. We also ran¬ 
domly flip the images in each mini-batch horizontally, as in standard practice. Training is performed with stochastic 
gradient descent with momentum. We use the same batch size (100 images) for all methods but the learning rate was 
optimized for each model independently. All the Deep02P models used the same e = 10“^ parameter value in (4). 


Method 

SIFT-O 2 P 

AlexNet 

AlexNet (S) 

Deep02P 

Deep02P(S) 

Deep02P-FC 

Deep02P-FC(S) 

Results 

36.4 

25.3 

27.2 

28.6 

32.4 

25.2 

28.9 


Table 1: Classification error on the validation set of MSCOCO (lower is better). Models with (S) suffixes were trained 
from scratch (i.e. random initialization) on the MSCOCO dataset. The Deep02P models only use a classification 
layer on top of the Deep 02 P layer whereas the Deep 02 P-FC also have fully connected layers in the same topology 
as AlexNet. All parameters of our proposed global models are refined jointly, end-to-end, using the proposed matrix 
backpropagation. 

Architecture and Implementation details. Implementing the spectral layers efficiently is challenging since the GPU 
support for SVD is still very limited and our parallelization efforts even using the latest CUBA 7.0 solver API have 
delivered a slower implementation than the standard CPU-based. Consequently, we use CPU implementations and 
incur a penalty for moving data back and forth to the CPU. The numerical experiments revealed that an implementation 
in single precision obtained a significantly less accurate gradient for learning. Therefore all computations in our 
proposed layers, both in the forward and backward passes, are made in double precision. In experiments we still 
noticed a significant accuracy penalty due to inferior precision in all the other layers (above and below the structured 
ones), still computed in single precision, on the GPU. 

The second formal derivation of the non-linear spectral layer based on an eigen-decomposition of Z = E + el 
instead of SVD of F is also possible but our numerical experiments favor the formulation using SVD. The alternative 
implementation, which is formally correct, exhibits numerical instability in the derivative when multiple eigenvalues 
have very close values, thus producing blow up in K. Such numerical issues are expected to appear under some 
implementations, when complex layers like the ones presented here are integrated in deep network settings. 

Results. The results of the recognition experiment are presented in table 1. They show that our proposed Deep02P- 
FC models, containing global layers, outperform standard convolutional pipelines based on AlexNet, on this problem. 
The bottom layers are pre-trained on ImageNet using AlexNet, and this might not provide the ideal initial input 
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Arch 

[45] 

AlexNet 

VGG 

Layer 


ReLU-5 

ReLU-4 

ReLU-5 

Method 

NCuts 

NCuts 

DeepNCuts 

NCuts 

DeepNCuts 

NCuts 

DeepNCuts 

Results 

.55 (.44) 

.59 (.49) 

.65 (.56) 

.65 (.56) 

.73 (.62) 

.70 (.58) 

.74 (.63) 


Table 2: Segmentation results give best and average covering to the pool of ground truth segmentations on the 
BSDS300 dataset [44] (larger is better). We use as baselines the original normalized cuts [45] using intervening 
contour affinities as well as normalized cuts with affinities derived from non-hnetuned deep features in different lay¬ 
ers of AlexNet (ReLU-5 - the last local ReLU before the fully connected layers) and VGG (hrst layer in block 4 and 
the last one in block 5). Our DeepNCuts models are trained end-to-end, based on the proposed matrix backpropagation 
methodology, using the objective J 2 . 


features. However, despite this potentially unfavorable initialization, our model jointly rehnes all parameters (both 
convolutional, and corresponding to global layers), jointly, end to end, using a consistent cost function. 

We note that the fully connected layers on top of the Deep02P layer offer good performance benehts. O 2 P over 
hand-crafted SIFT performs considerably less well than our Deep02P models, suggesting that large potential gains 
can be achieved when deep features replace existing descriptors. 

7.2 Full-Image Segmentation on BSDS300 

We use the BSDS300 dataset to validate our deep normalized cuts approach. BSDS contains 200 training images and 
100 testing images and human annotations of all the relevant regions in the image. Although small by the standards 
of neural network learning it provides exactly the supervision we need to rehne our model using global information. 
Note that since the supervision is pixel-wise, the number of effective datapoint constraints is much larger. We eval¬ 
uate using the average and best covering metric under the Optimal Image Scale (OIS) criterion [44]. Given a set of 
full image segmentations computed for an image, selecting the one that maximizes the average and best covering, 
respectively, compared to the pool of ground truth segmentations. 

Architecture and Implementation details. We use both the AlexNet[2] and the VGG-16[3] architectures to feed our 
global layers. All the parameters of the deep global models (including the low-level features, pretrained on ImageNet) 
are rehned end-to-end. We use a linear affinity but we need all entries of W to be positive. Thus, we use ReLU layers 
to feed the segmentation ones. Initially, we just cascaded our segmentation layer to different layers in AlexNet but 
the resulting models were hard to learn. Our best results were obtained by adding two Conv-ReLU pairs initialized 
randomly before the normalized cuts layer. This results in many hlters in the lower layer (256 for AlexNet and 1024 
for VGG) for high capacity but few in the top layer (20 dimensions) to limit the maximal rank of W. For AlexNet 
we chose the last convolutional layer while for VGG we used both the hrst ReLU layer in block^ 4 and the top layer 
from block 5. This gives us feeds from layers with different invariances, receptive held sizes (32 vs. 132 pixels) 
and coarseness (block 4 has 2x the resolution of 5). We used an initial learning rate of 10“^ but 10 x larger rates 
for the newly initialized layers. A dropout layer between the last two layers with a rate of .25 reduces overhtting. In 
inference, we generate 8 segmentations by clustering[14] then connected components are split into separate segments. 

Results. The results in table 2 show that in all cases we obtain important performance improvements with respect 
to the corresponding models that perform inference directly on original AlexNet/VGG features. Training using our 
Matlab implementation takes 2 images/s considering 1 image per batch while testing at about 3 images/s on a 
standard Titan Z GPU with an 8 core E5506 CPU. In experiments we monitor both the objective and the rank of 
the similarity matrix. Rank reduction is usually a good indicator of performance in both training and testing. In the 
context of the rank analysis in §6, we interpret these hndings to mean that if the rank of the similarity is too large 
compared to the target, the objective is not sufficient to lead to rank reduction. However if the rank of the predicted 
similarity and the ground truth are initially not too far apart, then rank reduction (although not always rank matching) 
does occur and improves the results. 

8 Conclusions 

Motivated by the recent success of deep network architectures, in this work we have introduced the mathematical 
theory and the computational blocks that support the development of more complex models with layers that perform 
structured, global computations like segmentation or higher-order pooling. Central to our methodology is the devel¬ 
opment of the matrix backpropagation methodology which relies on the calculus of adjoint matrix variations. We 
provide detailed derivations, operating conditions for spectral and non-linear layers, and illustrate the methodology 
for normalized cuts and second-order pooling layers. Our region visual recognition and segmentation experiments 

®We call a block the set of layers between two pooling levels. 
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Figure 3: Segmentation results on images from the test set of BSDS300. We show on the first column the input 
image followed by a baseline (original parameters) and our DeepNcuts both using AlexNet ReLU-5. Two other pairs 
of baselines and DeepNCut models trained based on the J 2 objective follow. The first pair uses ReLU-4 and the 
second ReLU-5. The improvements obtained by learning are both quantitatively significant and easily visible on this 
side-by-side comparison. 
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based on MSCoco and BSDS show that deep networks relying on second-order pooling and normalized cuts layers, 
trained end-to-end using the introduced practice of matrix backpropagation, outperform counterparts that do not take 
advantage of such global layers. 
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Appendix 

8.1 Notation and Basic identities 

In this section we present for completeness the notation and some basic linear algebra identities that are useful in 
the calculations associated to matrix backpropagation and its instantiation for log-covariance descriptors [64, 58] and 
normalized cuts segmentation [7]. 

The following notation is used in the derivations 

• The symmetric part Asym = | {AJ -|- A) of a square matrix A. 

• The diagonal operator A^nag for an arbitrary matrix A € which is the m x n matrix which matches 

A on the main diagonal and is 0 elsewhere. Using the notations diag(A) and [cc] to denote the diagonal of A 
(taken as a vector) and the diagonal matrix with the vector x in the diagonal resp., then Adiag = [diag(A)]. 

• The colon-product A : B = ^ AijBij = Ti{A^B) for matrices A,B G and the associated Frobenius 

hi 

norm ||A|| := \/AA. 

• The Hadamard (element-wise) product A o B. 

We note the following properties of the matrix inner product : 


A-. B = A^ B^ = B A 

(67) 

A : (BC) = {B^A) : C = {AC^) : B 

( 68 ) 

A : Bdiag — Adiag • ^ 

(69) 

A . BgyjYi — Asym ■ T? 

(70) 

A-.{BoC) = {BoA)-.C 

(71) 

(Ai A 2 ) : (i?i i?2) = Ax ■. Bi -\- A 2 B 2 

(72) 
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